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Abstract 

One  of  the  recent  tasks  for  machine  translation  research  has 
been  development  of  translation  capabilities  in  a  time  frame  as 
short  as  100  days.  Such  a  task  requires  developers  to  consider 
what  can  be  done  with  relatively  small  amounts  of  data  in  a 
small  time  frame.  This  inherently  limits  the  type  and  com¬ 
plexity  of  the  effort  to  be  devoted  to  this  task.  In  this  paper 
we  will  focus  on  the  kinds  of  improvements  for  a  Farsi-to- 
English  translation  system  achieved  by  means  of  algorithmic 
changes,  adding  raw,  domain-unspecific  resources,  and  unsu¬ 
pervised  morphological  segmentation.  The  cumulative  effect  of 
these  measures  has  been  an  improvement  in  BLEU  scores  of 
about  25%  relative  on  an  internal  test  set. 

Index  Terms:  statistical  machine  translation,  low  resource  lan¬ 
guages 

1.  Introduction 

The  advent  of  purely  data-driven,  statistical  methods  in  machine 
translation  has  made  it  possible  to  extract  the  human  knowl¬ 
edge  implicit  in  large  bodies  of  multilingual  texts.  Unlike  ear¬ 
lier  purely  knowledge-based  approaches,  which  required  the 
painstaking  adoption  of  linguistic  knowledge  in  the  form  of  ex- 
plict  rules,  statistical  machine  translation  (SMT)  allows  for  the 
development  of  a  working  system  in  a  relatively  short  amount 
of  time.  Since  the  amount  of  available  training  data  plays  a  key 
role  in  determining  the  performance  of  a  statistically  based  sys¬ 
tem,  it  is  generally  understood  that  the  best  way  to  improve  a 
system’s  performance  is  to  supply  more  training  data. 

For  a  given  language  pair,  however,  the  amount  of  overall 
available  resources  might  be  quite  limited;  hence,  performance 
improvement  will  have  to  be  sought  from  other  sources.  One  of 
the  tasks  in  the  current  DARPA  TransTac  program  is  the  rapid 
development  of  translation  capabilities  in  a  time  frame  as  short 
as  100  days.  Such  a  task  requires  developers  to  consider  what 
can  be  done  with  relatively  small  amounts  of  data  in  a  small 
time  frame.  This  inherently  limits  the  type  and  complexity  of 
the  effort  to  be  devoted  to  this  task.  For  instance,  given  the 
available  resources  of  the  language  in  question,  it  might  not  be 
practical  to  develop  a  reliable  morphological  analyzer  for  a  par¬ 
ticular  language  in  only  three  months,  let  alone  a  rule-based 
translation  engine  such  as  the  one  for  English  to  Iraqi  Arabic 
employed  in  SRI’s  IraqComm  system  ([1]). 

A  case  in  point  is  Farsi,  which  was  the  surprise  language 
chosen  for  the  2007  TransTac  task  of  developing  translation  ca¬ 
pabilities  in  at  most  100  days.  While  SRI  did  not  participate  in 
the  evaluation  itself,  we  took  the  opportunity  to  study  the  effect 
of  various  methods  for  improving  the  performance  of  our  SMT 
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Table  1 :  Farsi  data  quantities 


system.  In  this  paper  we  will  focus  on  the  kinds  of  improve¬ 
ments  achieved  by  means  of  (1)  algorithmic  changes,  (2)  adding 
domain-unspecific  resources,  and  (3)  unsupervised  morpholog¬ 
ical  segmentation.  In  particular,  we  were  interested  in  meth¬ 
ods  to  improve  translation  quality  that  can  be  brought  to  bear 
quickly  and  easily  without  any  language-specific  knowledge. 
While  the  task  encompasses  all  aspects  of  spoken-language 
translation,  we  will  concentrate  here  on  the  text-to-text  part 
only. 

2.  Approach 

We  describe  a  number  of  experiments  conducted  on  the 
TransTac  Farsi  data.  After  discussing  the  various  initial  pre¬ 
processing  steps  applied  to  the  data,  we  describe  SRI’s  statis¬ 
tical  machine  translation  system  SRInterpIM.  We  subsequently 
present  results  of  the  baseline  system  followed  by  various  ex¬ 
tensions. 

2.1.  Data 

The  basis  for  the  Farsi  SMT  system  was  the  data  supplied  by 
DARPA  for  the  2007  surprise  language  evaluation.  From  this 
initial  set  85,400  nonempty  aligned  sentence  pairs  were  set 
aside  for  training.  As  an  initial  data  cleaning  step,  pairs  con¬ 
taining  ASR  fragments  (such  as  we  tr-  we  try  to  provide  ...)  or 
ASR  “reject”  symbols  were  eliminated.  A  summary  of  the  re¬ 
sulting  data  quantities  is  given  in  Table  1. 

The  next  preprocessing  step  consisted  of  eliminating  filled 
pauses  (e.g.,  %um),  miscellaneous  markup  (e.g„  %breath)  and 
punctuation  symbols  from  the  data.  In  addition,  some  of  the 
dialectal  and  orthographic  variation  in  the  Farsi  data  was  nor¬ 
malized.  For  instance,  ‘his.name  OBJ’  occurs  both  in  the  con¬ 
ventional  spelling  AsmS  rA  or  as  AsmSv.1  To  this  end,  a  set  of 
12,561  normalization  mappings  supplied  by  NIST  was  applied 
to  the  data,  which  led  to  an  almost  20%  reduction  of  the  vocab- 


1  All  Farsi  examples  are  given  here  in  USCPers  transliteration, 
see  [2], 
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||  BLEU  score 
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Table  3:  Baseline  BLEU  scores  for  standard  phrase-based  Farsi- 
to-English  SMT  system 


ulary  size  of  the  training  set,  as  shown  in  Table  2. 

A  corresponding  set  of  199  replacement  rules  was  applied 
to  the  English  side,  regularizing  expressions  such  as  don ’t  to  do 
not ,  but  yielding  only  a  0.5%  reduction  in  vocabulary  size. 

Finally,  the  Farsi  data  was  transliterated  from  Persian  script 
to  a  purely  ASCII-based  format  using  the  USCPers  translitera¬ 
tion  scheme,  cf.  [2], 

2.2.  SRInterp  statistical  machine  translation  system 

The  SRInterp  engine  is  SRFs  SMT  decoder  ([3]),  which  sup¬ 
ports  both  standard  phrase-based  ([4])  and  hierarchical  phrase- 
based  translation  methods  ([5],  [6]).  The  standard  phrase-based 
translation  is  based  on  a  bilingual  phrase-pair  translation  model. 
Compared  to  earlier  methods  based  on  word-for-word  trans¬ 
lation,  phrase-based  approaches  are  superior  at  memorizing 
training  data  and  are  better  at  modeling  local  word  reordering. 
The  standard  phrase-based  approach,  however,  cannot  directly 
model  correspondences  that  involve  long-distance  relationships. 
Such  correspondences  can  pose  serious  problems  for  language 
pairs  with  rather  different  word  orders.  By  contrast,  hierarchi¬ 
cal  phrase-based  translation  is  based  on  lexicalized  synchronous 
context-free  grammars  that  are  far  superior  at  modeling  long¬ 
distance  dependencies  and  hence  provide  a  more  principled  ap¬ 
proach  to  word  reordering  models.  The  improved  ability  to  deal 
with  word  order  mismatches  seems  to  be  borne  out  in  the  dif¬ 
ference  in  performance  by  the  phrase-based  vs.  the  hierarchical 
variants  on  the  Farsi/English  data  as  shown  in  the  next  sections. 

2.3.  Baseline  standard  phrase-based  system 

We  first  used  the  data  described  in  Section  2. 1  to  train  a  standard 
phrase-based  system  and  evaluated  it  against  both  the  internal 
test  set.  The  performance  of  this  baseline  system  as  measured 
in  BLEU  scores  ([7])  is  given  in  Table  3. 

3.  Improvements 

Taking  the  standard  phrase-based  system  as  the  point  of  depar¬ 
ture,  we  investigated  strategies  of  improving  the  SMT  perfor¬ 
mance.  These  consisted  of  (1)  changing  the  underlying  transla¬ 
tion  approach,  (2)  adding  freely  available  but  domain-unspecific 
resources,  and  (3)  applying  morphological  segmentation. 

3.1.  Hierarchical  phrase-based  system 

The  same  data  was  used  to  train  a  hierarchical  phrase-based 
SMT  system.  We  observed  a  noticeable  improvement  of  16.6% 
relative  in  BLUE  score  over  the  phrase-based  system,  as  shown 
in  Table  4. 


1  ||  BLEU  score 

Relative  Improvement  | 

|  Internal  Test  ||  0.337 

16.6%  ; 

Table  4:  Baseline  BLEU  scores  for  hierarchical  phrase-based 
Farsi-to-English  SMT  system 


j  ||  BLEU  score 

Relative  Improvement  | 

|  Internal  Test  ||  0.348 

3.26%  ] 

Table  5:  BLEU  scores  for  hierarchical  Farsi-to-English  SMT 
system,  augmented  with  Shiraz  dictionary 


We  believe  that  to  a  great  extent  these  improvements  can 
be  attibuted  to  the  difference  in  word  order  between  English 
and  Farsi,  which  are  more  suitably  handled  by  a  hierarchical 
phrase-based  system.  For  instance,  Farsi  is  an  SOV  language, 
which  means  that  the  finite  verb  tends  to  occur  later  in  the  clause 
than  in  the  corresponding  English  sentence.  This  is  illustrated 
in  the  following  sentence  pair  in  which  English  clause-medial 
have  is  matched  with  Farsi  clause-final  dArm: 


bih  yh  g#mAmh  Jdyd  dArm 
yes  one  passport  new  i.have 
‘yes  I  have  a  new  passport.’ 


In  the  next  sections  we  investigate  the  result  of  adding  more 
data  resources  and  applying  segmentation  into  subword  units. 

3.2.  Additional  data  sources 

In  addition  to  the  DARPA-supplied  Farsi  data,  we  considered 
the  extent  to  which  other  freely  available  online  resources  could 
be  utilized.  To  the  best  of  our  knowledge,  the  most  exten¬ 
sive  previous  computational  investigation  involving  Farsi  was 
conducted  within  the  Shiraz  project  at  New  Mexico  State  Uni¬ 
versity  ([8]).2  Among  the  resources  freely  available  from  that 
project  is  a  bilingual  dictionary  containing  7 1 ,306  Farsi-English 
entries  (52,045  distinct  Farsi  entries,  23,639  of  which  being  sin¬ 
gle  words).  Since  the  entries  are  designed  to  work  with  the  Shi¬ 
raz  morphological  analyzer,  they  are  not  necessarily  in  the  form 
most  conducive  to  improving  SMT  quality.  In  particular,  since 
the  Farsi  entries  are  given  in  citation  form  (infinitive  for  verbs, 
singular  definite  for  nouns),  the  entries  do  not  necessarily  match 
the  inflected  variants  found  in  the  training  data.  As  a  result,  only 
9,355  of  the  Farsi  entries  are  actually  found  in  the  training  data. 
Moreover,  in  terms  of  adding  translations  for  words  not  seen  in 
the  original  training  data,  the  Shiraz  dictionary  contributes  only 
123  and  31  new  Farsi  entries  to  the  coverage  of  the  internal 
and  Eval  test  sets,  respectively.3  At  the  same  time,  it  is  triv¬ 
ial  to  add  the  Farsi-English  translation  pairs  to  the  training  data 
and  subject  them  to  the  same  preprocessing  steps  as  the  original 
training  data.4  Without  any  further  processing,  this  leads  to  a 
relative  improvement  of  3.26%  in  BLEU  scores  over  the  origi¬ 
nal  hierarchical  phrase-based  system,  as  is  shown  in  Table  5. 
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Table  6:  Data  statistics  of  MORFESSOR-induced  segmentations 
for  various  perplexity  threshold  (PPL)  settings 


3.3.  Unsupervised  morphological  segmentation 

One  of  the  challenges  for  SMT  is  the  fact  that  languages  differ 
in  terms  of  what  information  content  is  packaged  into  individual 
words.  What  gets  expressed  as  a  single  word  in  one  language 
may  correspond  to  a  series  of  words  in  the  other.  Inasmuch  as 
it  is  possible  to  reduce  such  discrepancies,  the  quality  of  word 
alignments  is  likely  to  improve.  For  instance,  Farsi  dvstm  is 
most  naturally  translated  into  English  ‘my  friend’;  hence,  one 
way  to  achieve  a  closer  correspondence  between  English  and 
Farsi  word  units  is  to  split  the  Farsi  expression  into  noun  and 
possessive  pronoun  segments: 


dvst  dvstm  =>•  dvst  -m 
friend  my  friend 


A  reliable  morphological  analyzer  for  a  new  language  usu¬ 
ally  cannot  be  developed  without  considerable  investment  of  ef¬ 
fort  and  linguistic  expertise,  which  may  not  be  available  for  a 
rapid  development  task.  This  was  in  fact  the  case  for  Farsi,2 3 4 5  so 
we  turned  to  unsupervised  methods  of  detecting  subword  units. 
While  such  methods  yield  segmentations  that  often  do  not  cor¬ 
relate  very  well  with  more  linguistically  sound  segmentations, 
they  have  the  obvious  benefit  of  being  applicable  even  if  little 
or  nothing  is  known  about  the  morphology  of  a  new  language. 
Whether  automatically  derived  segmentation  leads  to  any  im¬ 
provement  over  the  segmentation-less  baseline  system  can  be 
determined  rather  quickly  and  without  having  to  settle  on  the 
type  of  morphological  segmentation  to  be  used  (i.e.,  what  kinds 
of  morphemes  of  what  kinds  of  lexical  categories  should  be  sep¬ 
arated). 

To  this  end  we  adopted  the  MORFESSOR  utility  ([10]), 6 
specifically  the  MORFESSOR  Categories-MAP  algorithm  ([11]) 
as  a  way  to  obtain  a  segmentation  of  the  data  into  morpheme¬ 
like  units  (“morphs”).  In  particular  we  used  the  training  set  to 
train  a  segmentation  model  that  then  was  applied  to  the  remain¬ 
ing  data  allowing  for  the  segmentation  of  unseen  words  on  the 
basis  of  the  data  seen  earlier.  One  of  the  parameters  affecting 
MORFESSOR  segmentation  behavior  is  the  perplexity  threshold 
(PPL),  which,  roughly  speaking,  regulates  the  aggressiveness 
with  which  affixes  are  postulated.  In  addition  to  the  default  set¬ 
ting  of  10,  we  explored  other  values  and  found  settings  lower 


2http://crl.nmsu.edu/Resources/lang_res/persian.html 

3  All  but  nine  of  the  new  entries  occur  only  once  in  either  test  set. 

4This  method  was  inspired  by  [9],  who  used  a  raw  bilingual  dictio¬ 
nary  to  improve  the  performance  of  their  Cebuano  system. 

5  Attempts  to  utilize  the  Shiraz  morphological  analyzer  remained  in¬ 
conclusive. 

6  www.cis.hut.fi/projects/morpho/morfessorcatmapdownloadform.shtml 
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than  10  to  be  more  effective  for  this  particular  task.7  Table  3.3 
shows  the  effect  of  MORFESSOR  segmentations  for  different 
PPL  settings  on  the  training  set,  with  the  column  headed  “None” 
illustrating  the  original  unsegmented  data  for  comparison.  In 
general,  higher  PPL  values  correspond  to  a  higher  number  of 
segmented  tokens.  However,  while  MORFESSOR  segmentation 
leads  to  a  35-40%  vocabulary  reduction  over  the  nonsegmented 
texts,  greater  PPL  numbers  do  not  necessarily  amount  to  smaller 
vocabularies.  At  the  same  time,  greater  PPL  numbers  do  result 
in  a  smaller  affix  inventory  used  in  the  segmentation. 

Table  7  shows  the  results  of  various  segmentations  on 
BLUE  scores.  At  the  MORFESSOR  default  PPL  setting  of  10, 
there  is  basically  no  improvements  baseline.  Better  results, 
however,  can  be  found  among  the  lower  PPL  settings,  in  partic¬ 
ular  at  the  PPL  setting  of  4,  which  yields  a  relative  improvement 
of  2.0%. 

The  next  experiment  involved  applying  MORFESSOR- 
derived  segmentations  on  the  original  training  set  augmented 
with  the  Shiraz  dictionary.  For  systems  trained  on  these  sets  we 
observe  a  improvements  over  the  nonsegmented  system  for  each 
PPL  setting.  The  best  performance  of  0.362  can  be  observed  at 
PPL=4,  which  constitutes  a  relative  improvement  of  2%  over 
the  corresponding  MORFESSOR-segmented  system  without  the 
Shiraz  dictionary. 

The  cumulative  effect  of  these  measures  is  a  relative  im¬ 
provement  over  the  original  standard  phrase-based  system  in 
BLEU  scores  of  about  25.33%.  This  system  also  performs  quite 
well  when  compared  with  those  systems  that  participated  in  the 
2007.  As  is  shown  in  Table  9,  the  performance  of  our  system 
on  the  offline  data  set  is  only  about  1%  relative  worse  than  that 
of  the  best-performing  team. 


7Lagus  and  Creutz  suggest  that  the  proper  setting  is  a  function  of 
the  data  size  and  that  larger  training  corpora  require  higher  settings  for 
optimal  performance. 


Team  1 

Hierarchical  +  Shiraz  +  MORFESSOR,  PPL=4 

0.357 

0.353 

Table  9:  Performance  on  2007  offline  evaluation  test  set,  com¬ 
pared  with  best-performing  team 


SMT  system 

Eval  test 

Standard  phrase-based 

0.216 

Hierarchical  phrase-based 

0.220 

Hierarchical  +  Shiraz 

0.225 

Table  10:  BLEU  scores  for  English  to  Farsi  SMT  systems 


3.4.  English  to  Farsi 

While  most  of  our  effort  has  been  focused  on  building  and 
improving  a  system  for  Farsi-to-English  translation,  a  certain 
amount  of  work  was  also  devoted  to  the  other  direction.  While 
the  absolute  scores  are  considerably  lower  than  for  the  Farsi- 
to-English  system,  we  see  the  same  trendlines  as  before,  i.e., 
improvements  for  the  hierarchical  phrase-based  system  as  well 
for  added  Shiraz  entries,  as  shown  in  Table  10.  Experiments  in¬ 
volving  MORFESSOR-induced  segmentations  for  Farsi  have  so 
far  remained  inconclusive. 

4.  Conclusions 

As  our  results  indicate,  moving  from  stardard  to  hierarchical 
phrase-based  SMT  can  result  in  significant  performance  im¬ 
provements  for  language  pairs  with  different  word  order  pat¬ 
terns.  Similarly,  we  have  found  that  additional  general-purpose 
resources  such  as  dictionaries  can  be  helpful  even  without  any 
additional  morphological  adjustments.  Finally,  we  have  shown 
that  unsupervised  methods  of  morphological  segmentation  can 
indeed  help  improve  the  performance  of  an  SMT  system.  This 
finding  contrasts  with  that  of  [12],  who  report  no  gain  for  using 
fully  automatically  derived  segmentations  in  SMT  tasks  involv¬ 
ing  Nordic  languages.  Whether  or  not  morphological  segmen¬ 
tation  leads  to  any  gain  might  of  course  depend  on  the  particular 
language  pair  chosen;  at  the  same  time  we  suspect  that  the  sig¬ 
nificantly  larger  data  quantities  (860,000  aligned  sentences  re¬ 
ported  in  [12])  may  also  result  in  a  diminished  impact  for  mor¬ 
phological  segmentation. 

In  future  work  we  intend  to  further  explore  the  utility  of 
unsupervised  segmentation  methods.  In  particular,  we  intend 
to  compare  our  current  results  with  segmentations  applied  to 
both  source  and  target  sides.  In  addition,  while  MORFESSOR 
utilizes  a  single  parameter  to  regulate  the  segmentation  of  both 
prefixes  and  suffixes,  we  conjecture  that  a  more  fine-grained  ap¬ 
proach  that  deals  with  prefixes  and  suffixes  independently  might 
be  able  to  better  match  the  morphological  characteristics  of  a 
given  language. 
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